10 - MLPDES25: Transformers are universal in-context learners [ID:57446]
50 von 513 angezeigt

So yeah, I have a difficult task because I'm going to make a short introduction talk and

then there will be a later talk I think by Borgan and Dominique that will speak maybe

about similar topics.

We'll also speak about optimal transport, so it's kind of a patchwork of several ideas,

but the key message is how to use techniques from PDE and modeling token as distribution

in order to gain some understanding on very deep transformers, which was initially introduced

in the PhD thesis of Michael Sanders, I forgot to put this picture here, but then I would

mention two contributions with Takashi and Martin on expressivity and Valérie and Pierre

I would say on the smoothness and on the PDE side.

Just to explain very briefly what are attention mechanisms and I think Borgan will go over

probably also.

This is basically the heart of most recent architecture and the key idea I would say

is that you would start by tokenizing the data.

So this is an example for text, but you can do vision transformers for images and for

proteins and transformers are really used everywhere and they are replacing basically

traditional neural network.

So you tokenize the data, you split into a group of letters and then each group of letters

is going to be encoded in a vector and you also add to this vector the information about

the position.

So its token is going to model some group of letters in the text plus its position in

the text, which means that you would, and you can do this for images, for images you

chunk your image into small patches.

So it means that now the data set or all the input data would be a collection of points.

So this is really a paradigm change because now it means that basically neural network

will operate on possibly infinite dimensional space because the number of tokens can be

arbitrarily large and can in fact in practice be very very large and when you use it to

generate new tokens you would really generate possibly infinitely long sequences.

So I think this is a major paradigm shift and the key idea or question that I want to

not really address but at least put forward is how to do theorems in infinite dimensional

space for neural architecture.

So how it goes it basically alternates just a single set of layers which are always the

same so there are three parts in a transformer.

The first one is normalization layer so I will not speak about this.

I think Borgan will probably do it, which means you project your token on the sphere.

It's very important because the attentions are very sensitive to the norm of the token

and so on but in my talk I will basically not speak about this.

Then there is the attention mechanism which is a big novelty so I spent a lot of time

to speak about this.

But what is very important is you also have traditional MLPs, so traditional small neural

network that would operate token independently.

So each token would be processed independently.

So of course this is very old so I will not speak about it but in practice it is also

very important.

In fact when you look about DeepSeq, Francis model, most of the parameters are in MLPs.

So MLPs are those that carry the most number of parameters.

If you remove this it doesn't work but of course it's like playing old MLPs so I would

not speak a lot.

So what is the attention at a high level?

It's a mechanism that would take each token and move it to a new location but this would

be conditioned on the neighbors.

Presenters

Prof. Gabriel Peyré Prof. Gabriel Peyré

Zugänglich über

Offener Zugang

Dauer

00:28:24 Min

Aufnahmedatum

2025-04-29

Hochgeladen am

2025-04-29 16:13:30

Sprache

en-US

#MLPDES25 Machine Learning and PDEs Workshop 
Mon. – Wed. April 28 – 30, 2025
HOST: FAU MoD, Research Center for Mathematics of Data at FAU, Friedrich-Alexander-Universität Erlangen-Nürnberg Erlangen – Bavaria (Germany)
 
SPEAKERS 
• Paola Antonietti. Politecnico di Milano
 • Alessandro Coclite. Politecnico di Bari
 • Fariba Fahroo. Air Force Office of Scientific Research
 • Giovanni Fantuzzi. FAU MoD/DCN-AvH, Friedrich-Alexander-Universität Erlangen-Nürnberg
 • Borjan Geshkovski. Inria, Sorbonne Université
 • Paola Goatin. Inria, Sophia-Antipolis
 • Shi Jin. SJTU, Shanghai Jiao Tong University 
 • Alexander Keimer. Universität Rostock
 • Felix J. Knutson. Air Force Office of Scientific Research
 • Anne Koelewijn. FAU MoD, Friedrich-Alexander-Universität Erlangen-Nürnberg
 • Günter Leugering. FAU, Friedrich-Alexander-Universität Erlangen-Nürnberg
 • Lorenzo Liverani. FAU, Friedrich-Alexander-Universität Erlangen-Nürnberg
 • Camilla Nobili. University of Surrey
 • Gianluca Orlando. Politecnico di Bari
 • Michele Palladino. Università degli Studi dell’Aquila
 • Gabriel Peyré. CNRS, ENS-PSL
 • Alessio Porretta. Università di Roma Tor Vergata
 • Francesco Regazzoni. Politecnico di Milano
 • Domènec Ruiz-Balet. Université Paris Dauphine
 • Daniel Tenbrinck. FAU, Friedrich-Alexander-Universität Erlangen-Nürnberg
 • Daniela Tonon. Università di Padova
 • Juncheng Wei. Chinese University of Hong Kong
 • Yaoyu Zhang. Shanghai Jiao Tong University
 • Wei Zhu. Georgia Institute of Technology
 
SCIENTIFIC COMMITTEE 
• Giuseppe Maria Coclite. Politecnico di Bari
• Enrique Zuazua. FAU MoD/DCN-AvH, Friedrich-Alexander-Universität Erlangen-Nürnberg
 
ORGANIZING COMMITTEE 
• Darlis Bracho Tudares. FAU MoD/DCN-AvH, Friedrich-Alexander-Universität Erlangen-Nürnberg
• Nicola De Nitti. Università di Pisa
• Lorenzo Liverani. FAU DCN-AvH, Friedrich-Alexander-Universität Erlangen-Nürnberg
 
Video teaser of the #MLPDES25 Workshop: https://youtu.be/4sJPBkXYw3M
 
 
#FAU #FAUMoD #MLPDES25 #workshop #erlangen #bavaria #germany #deutschland #mathematics #research #machinelearning #neuralnetworks

Tags

Erlangen mathematics Neural Network PDE Applied Mathematics FAU MoD Partial Differential Equations Bavaria Machine Learning FAU MoD workshop FAU
Einbetten
Wordpress FAU Plugin
iFrame
Teilen